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Figure 1 : Map of TVCG based on 1,343 TVCG titles in DBLP, heatmap overlay based on 34 papers by the most prolific TVCG author. (Multi- 
Word Term extraction, C-Value with Unigrams ranking, Partial Match Jaccard Coefficient similarity, Pull Lesser Terms filtering, number of 
terms 1500.) 



Abstract 

We describe a practical approach for visual exploration of research 
papers. Specifically, we use the titles of papers from the DBLP 
database to create what we call maps of computer science (MoCS). 
Words and phrases from the paper titles are the cities in the map, 
and countries are created based on word and phrase similarity, cal- 
culated using co-occurence. With the help of heatmaps, we can 
visualize the profile of a particular conference or journal over the 
base map. Similarly, heatmap profiles can be made of individual 
researchers or groups such as a department. The visualization sys- 
tem also makes it possible to change the data used to generate the 
base map. For example, a specific journal or conference can be used 
to generate the base map and then the heatmap overlays can be used 
to show the evolution of research topics in the field over the years. 
As before, individual researchers or research groups profiles can 
be visualized using heatmap overlays but this time over the journal 
or conference base map. Finally, research papers or abstracts eas- 
ily generate visual abstracts giving a visual representation of the 
distribution of topics in the paper. We outline a modular and exten- 
sible system for term extraction using natural language processing 
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techniques, and show the applicability of methods of information 
retrieval to calculation of term similarity and creation of a topic 
map. The system is available at mocs . cs . arizona. edu| 

1 Introduction 

Providing efficient and effective data visualization is a difficult 
challenge in many real- world software systems. One challenge lies 
in developing algorithmically efficient methods to visualize large 
and complex data sets. Another challenge is to develop effective 
visualizations that make the underlying patterns and trends easy to 
see. Even tougher is the challenge of providing interactive access, 
analysis, and filtering. All of these tasks become even more diffi- 
cult with the size of the data sets arising in modern applications. In 
this paper we describe maps of computer science (MoCS), a func- 
tional visualization system for a large relational data set, based on 
spatialization and map representations. 

Spatialization is the process of assigning 2D or 3D coordinates 
to abstract data points, ideally in such a way that the spatial map- 
ping has much of the characteristics of the original (higher dimen- 
sional) space. Multi-dimensional scaling (MDS), principal com- 
ponent analysis (PCA), and force-directed methods are among the 
standard techniques that allow us to spatialize high-dimensional 
data. 

Map representations provide a way to visualize relational data 
with the help of conceptual maps as a data representation metaphor. 
Graphs are a standard way to visualize relational data, with the ob- 
jects defining vertices and the relationships defining edges. It re- 
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Figure 2: The main steps of the MoCS system are querying documents from DBLP, extracting terms from these titles, ranking terms by 
importance, calculating term similarity, further filtering terms based on similarity, and finally performing multidimensional scaling and 
clustering to produce a basemap, over which a heatmap can be overlaid. 



quires an additional step to get from graphs to maps: clusters of 
well-connected vertices form countries, and countries share borders 
when neighboring clusters are tightly interconnected. 

Traditional maps offer a natural way to present geographical data 
(continents, countries, states) and additional properties defined with 
the help of contours and overlays (topography, geology, rainfall). In 
the process of data mining and data analysis, clustering is a very im- 
portant step. It turns out that maps are very helpful in dealing with 
clustered data. There are several reasons why a map representation 
of clusters can be helpful. First, by explicitly defining the bound- 
ary of the clusters and coloring the regions, we make the clustering 
information clear. Second, as most dimensionality-reduction tech- 
niques lead to a two-dimensional positioning of the data points, a 
map is a natural generalization. Finally, while it often takes us con- 
siderable effort to understand graphs, charts, and tables, a map rep- 
resentation is intuitive, as most people are familiar with maps and 
map-based interactions such as pan and zoom. 

We describe a practical approach for visualizing data from the 
DBLP bibliography server [31]. Specifically, we use the titles of 
2,184,270 papers in the database to create, what we call, maps of 
computer science (MoCS), where words and phrases from the ti- 
tles are the cities and where the countries are created based on co- 
occurrence. With the help of heatmap overlays, we can visualize 
the profile of a particular conference or journal over the base map. 
Similarly, individual researchers or groups such as a department 
can be used to generate heatmap profiles. The visualization system 
also makes it possible to change the data used to generate the base 
map. For example, a specific journal or conference can be used to 
generate the base map and then the heatmap overlays can be used to 
show the evolution of research topics in the field over the years with 
the help of small multiples. As before, individual researchers or re- 
search groups can profiles can be visualized using heatmap overlays 
but this time over the journal or conference base map. Finally, re- 
search papers or abstracts easily generate visual abstracts. 

An overview of our MoCS system is in Figure [2] and our main 
contributions are as follows. First, we describe a fully functional vi- 
sualization system MoCS which interactively generates base maps 



of computer science from the DBLP bibliography server: from 
maps based on all papers available in the database, to maps based 
on a particular journal or conference, to maps based on an individ- 
ual researcher. Second, our system allows us to visualize temporal 
heatmap overlays making it possible to visualize the evolution of 
the field, journals, and conferences over time. Third, our system 
allows us to visualize individual heatmap overlays making it pos- 
sible to visualize individual researchers in the field, or individual 
researchers in a particular conference, or individual papers in a par- 
ticular conference. Finally, the MoCS system is modular, extensi- 
ble and with complete source code, thus making it easy to change 
various components: from the various natural language processing 
steps, to the creation of the graph that models the topics, to the 
visualization of the results. 

2 Related Work 

Using maps to visualize non-cartographic data has been considered 
in the context of spatialization by Skupin and Fabrikant [41] and 
Fabrikant et al. [ 15]. Map-like visualization using layers and ter- 
rains to represent text document corpora dates back at least to 1995 
Wise et al. approach [47]. Cortese et al. [10] also use a topograph- 
ical map metaphor to visualize prefixes propagation in the Internet, 
where contour lines describing the propagation are calculated using 
a force directed algorithm. The problem of effectively conveying 
change over time using a map-based visualization was studied by 
Harrower [22]. Also related is work on visualizing subsets of a set 
of items using geometric regions to indicate the grouping. Byelas 
and Telea [6] use deformed convex hulls to highlight areas of in- 
terest in UML diagrams. Collins et al. (8) use "bubblesets," based 
on isocontours, to depict multiple relations among a set of objects. 
Simonetto et al. [40] automatically generate Euler diagrams which 
provide one of the standard ways, along with Venn diagrams, for 
visualizing subset relationships. 

GMap uses the geographic map metaphor for visualizing rela- 
tional data and was proposed in the context of visualizing recom- 
mendations, where the underlying data is TV shows and the simi- 
larity between them [20 24]. This approach combines graph layout 



and graph clustering, together with appropriate coloring of the clus- 
ters and creating countries based on clusters and connectivity in the 
original graph. A comprehensive overview of graph based repre- 
sentations by von Landesberger et al. [45] considers visual graph 
representation, interaction, editing, and algorithmic analysis. 

Word clouds and tag clouds have been in use for many years [38., 
[43) . The popular tool, Wordle [44] took word clouds to the next 
level with high quality design, graphics, style and functionality. 
While these early approaches do not explicitly use semantic infor- 
mation such as word relatedness in placing the words in the cloud, 
several more recent approaches do. Koh et al. [28] use interaction 
to add semantic relationship in their ManiWordle approach. Paral- 
lel tag clouds by Collins et al. (9) are used to visualize evolution 
over time with the help of parallel coordinates. Cui et al. [11] cou- 
ple trend charts with word clouds to keep semantic relationships, 
while visualizing evolution over time with help of force-directed 
methods. Wu et al. [48] introduce a method for creating semantic- 
preserving word clouds based on a seam-carving image processing 
method and an application of bubble sets. Paulovich et al. [ 37] com- 
bine semantic proximity with techniques for fitting word clouds in- 
side general polygons. They apply this technique to a collection of 
documents and obtain several word clouds of related terms, while 
optimizing word packing into polygons with semantic preservation. 
Hierarchically clustered document collections have been the do- 
main of many visualizations based on self-organizing maps [30], 
Voronoi diagrams [2], and Voronoi treemaps [36]. Of course, clas- 
sical treemaps [39] and their variants are also often used to visualize 
text collections. 

There is a great deal of related work on natural language pro- 
cessing, text summarization, topic extraction and associated visu- 
alizations. Statistical topic modeling relies on machine learning 
techniques to extract semantic or thematic topics from a text col- 
lection, e.g., via Latent Semantic Analysis (12), or Latent Dirichlet 
Allocation [4|. Extensions to these topic models allow discovery of 
topics underlying multi-word phrases [46] and the use of additional 
syntactic structure, such as sentence parse trees, to aid inference 
of topics [5 ]. The topics provide an abstract representation of the 
text collection and are used for searching and categorization. For 
example, Grouper [49] presents search results as sets of documents 
clustered by common phrases. TopicNets [21] assigns the top two 
words as a summary of the underlying text. Topiclslands [34] is 
one of the early visualizations, based on wavelets. More recently, 
Facetatlas |7 ] uses similarity between documents to create a graph 
which can be used to visually explore the data. PhraseNets supports 
search for user provided word-pairs which are then used to create 
graph-based visualization of text [42]. TagRiver [16] uses word 
clouds to visualize temporal changes in semantic data. The TIARA 
system [32] uses text summarization techniques and ThemeRiver- 
style visualization [23] to summarize large text collections. 

3 Maps of Computer Science 

Here we describe the main steps in the system: natural language 
processing (term extraction, term ranking, term filtering, similarity 
matrix), graph and map generation (distance matrix, embedding, 
clustering, coloring). 

3.1 Term Extraction 

In the first step of map creation, multi-word terms are extracted 
from the titles of papers in DBLR Part of speech (POS) tags are 
used to choose words that constitute topically meaningful terms, 
and exclude functional words (words that convey little semantic 
meaning, such as "the", "and", and "a"). The Natural Language 
Toolkit (NLTK) POS tagger [3] is used to label the words in all ti- 
tles with POS tags. Before running the tagger, titles are converted 
to lowercase, since the tagger is case- sensitive, and more likely to 
incorrectly label capitalized words as proper nouns. Once a title is 
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Figure 3: Section of a multi-word term map, built from 1,343 
TVCG paper titles using the C-Value with Unigrams ranking, Par- 
tial Match Jaccard Coefficient similarity, and Pull Lesser Terms fil- 
tering functions, with the number of terms parameter set to 1500. 
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Figure 4: Section of a single- word term map, built from 1,343 
TVCG paper titles using the TF ranking, LSA similarity, and Pull 
Lesser Terms filtering functions, with the number of terms parame- 
ter set to 1800. 



tagged, maximal subsequences of words with POS tags matching 
the following regular expression are extracted from titles: 

( (J J) | {JJR) | {JJS) | (AW) | {NNS) | {NNP) | {NNPS) ) * 

JJ,JJR, and JJS are tags representing normal adjectives, compara- 
tive adjectives, and superlative adjectives, respectively, while NN, 
NNS, NNP, and NNPS are nouns, plural nouns, proper nouns, and 
proper plural nouns, respectively. This regular expression was cho- 
sen to extract a subset of noun and adjectival phrases including 
modifiers such as noun adjuncts and attributive adjectives. For ex- 
ample, the paper title "Interactive Support for Non-Programmers: 
The Relational and Network Approaches" is assigned the tag se- 
quence J J NN IN NNS DT J J CC NN NNS. The subsequences 
J J NN, NNS, JJ, and NN NNS are matched, and their correspond- 
ing word sequences "interactive support", "non-programmers", "re- 
lational", and "network approaches" are extracted as terms. Maps 
can be created with these multi-word terms (Fig. [3j, or the terms 
can be broken up into their constituent words (Fig.|4f to parallel the 
word-based visual representations of systems such as Wordle (44). 
Maps and visualizations made from single words can display 
broad associations between words, as demonstrated in the semantic 
word clouds of Wu et al. [48]. Multi-word terms can provide a fine- 
grained view of the topics represented in the database of paper titles. 
For example, using single-word terms extracted from the titles of 
40,000 randomly sampled DBLP papers, and the Latent Semantic 



Analysis similarity function (described below), the 5 most similar 
terms to "network" are "neural", "wireless', "sensor", "analysis", 
and "model". This list of similar terms helps reveal that there are 
different types of networks. Using multi-word terms and the C- 
value With Unigrams ranking function (described below), we find 
that the terms "neural network" and "wireless sensor network" ap- 
peared frequently in titles, and are both ranked in the top 1500 terms 
from this document set, helping to explain why "neural", "wire- 
less", and "sensor" were highly associated with "network" in the 
single-word version. We can use multi-word terms and similarity 
to investigate what topics are closely related to each specific type of 
network. Using the Jaccard similarity function (described below), 
the 4 terms ranked as most similar to "neural network" are "predic- 
tions", "genetic algorithm", "dynamics", and "combinatorial op- 
timization problem", while the 4 most similar terms to "wireless 
sensor network" are "reinforcement", "mobile robot", "modeling", 
and "energy", showing disparate applications and related topics for 
the two different types of networks. 

3.2 Term Ranking 

Once multi-word or single word terms are extracted, they can be 
assigned importance scores, or weights, based on their usage in the 
corpus of titles. Terms are then ordered by their weights to produce 
a ranking of terms, of which the top terms can be selected for in- 
clusion in the visual map representation. We implement four such 
ranking functions in the MoCS system: Term Frequency, Term Fre- 
quency/Inverse Comparison Frequency, C-Value, and C-Value with 
Unigrams. 

Under the term frequency ranking function, each term's weight 
is the number of times it occurred within the corpus. Term fre- 
quency tends to highly weight functional words such as determin- 
ers and conjunctions: words that appear frequently but convey little 
meaning such as "the", "a". In our system, many of these func- 
tional words are already excluded by the term extraction step, if 
their POS tags do not match the noun and adjectival phrase extrac- 
tion expression. However, we still want to provide the option to 
exclude common phrases that convey little semantic meaning, such 
as "introduction" (which occurs 9th in a list of multi-word terms 
ordered by frequency from a 1,000,000 title sample of DBLP, oc- 
curring 618 times). To accomplish this, a standard modification 
to term-frequency is term frequency-inverse document frequency 
(TF/IDF), where a term's weight in a text collection is proportional 
to its frequency in the document and inversely proportional to the 
number of other documents it appears in. In our domain, consisting 
of many short documents (titles), terms usually only occur once in 
each document, so the inverse document frequency of a term is al- 
most always 1 . Therefore, we further modify TF/IDF to this corpus 
by treating the entire collection of titles as a single document, and 
counting the term's frequency in a reference corpus from a differ- 
ent domain to use as the inverse weighting value. We refer to the 
resulting method as term frequency-inverse comparison frequency 
(TF/ICF). A term's weight under TF/ICF is the number of times the 
term appeared in the corpus of documents {target corpus), divided 
by the number of times that term appeared in a disparate corpus of 
text from a different domain (the comparison corpus): 



weight (t) - 



Target (t) 
Comp(t) 



In the above equation, Hsa term, Target (t) is the count of times that 
t appeared in the target corpus (DBLP titles) as a complete term, 
and Comp(t) is the count of times that t appeared in the comparison 
corpus. The MoCS system currently uses the Brown Corpus [ 17], a 
selection of English text drawn from newspapers, fiction, and other 
wide-distribution literature, as the comparison corpus. 

C-value [18 1 is specifically designed for multi-word term rank- 
ing, accounting for possible nesting of multi-word terms (where 



short terms appear as word subsequences of longer terms). C- 
value incorporates total frequency of occurrence, frequency of oc- 
currences of the term within other longer terms, the number of types 
of these longer terms, and the number of words in the term. The 
weight assigned by C-value is proportional to the logarithm of the 
number of words in a term, so we also include a modified imple- 
mentation, C-value With Unigrams, that adds one to this length be- 
fore taking the logarithm. This modification allows single word 
terms to be assigned non-zero weight and be included in the set of 
top terms. 

After terms are assigned importance weights, they are sorted in 
order of descending weight, and the top N terms are selected for 
possible inclusion in the map. N (Number of Terms) is a config- 
urable parameter passed to the MoCS system. Larger values of N 
produce maps that include terms ranked lower by the chosen rank- 
ing algorithm, i.e., words with lower weighted term frequency in 
the set of titles queried. 

3.3 Similarity Matrix Computation 



Once a set of top terms is selected, pairwise similarity values be- 
tween top terms are calculated. We seek similarity functions that 
measure how closely the topics represented by two terms are re- 
lated. Terms that refer to the same or similar topic, or topics that 
are closely associated, should receive high similarity values. We 
use term-document co-occurrence as the basis of these similarity 
values, assuming that terms that appear together in multiple docu- 
ments (paper titles) are more likely to be related in meaning. 

The similarity functions take a term-document matrix, M, as in- 
put. The columns of M correspond to titles of papers from DBLP, 
and rows correspond to terms extracted by the term-extraction step. 
The entries in the matrix are calculated as 

Mfj = occurrences j (termi) 

where occurrences j(term\) is the number of times the term indexed 
by i appeared in the document indexed by j. We implement three 
similarity functions in the MoCS system: Latent Semantic Analysis, 
Jaccard Coefficient, and Partial Match Jaccard coefficient. 

Latent Semantic Analysis (LSA), described by Deerwester et 
al. fT2) , is a method of extracting underlying semantic represen- 
tation from the term-document matrix, M. A low-rank approxima- 
tion to the term-document matrix is used to calculate the distance 
between terms in a vector- space representation reflecting meaning 
in topical space. The singular value decomposition 

M = uiy T 

is calculated using sparse-matrix methods. Rows in the product ITL 
represent terms as feature vectors in the high-dimensional seman- 
tic space. Terms are compared using cosine similarity [33] of the 
feature vectors to produce a matrix of pairwise similarities between 
terms. The cosine similarity of two term vectors Vj,v/ is calculated 
as 

cos(0) = 



IvilllMi 

The value returned by this function is bounded between 0, indicat- 
ing a maximal angle between the term vectors in semantic space and 
no similarity between the terms, and 1, indicating the term vectors, 
measuring decomposed co-occurrence, are identical. 

LSA is a standard approach to calculating term and document 
similarity in information retrieval. However, as in the term ranking 
stage, terms rarely occur more than once in a single document (par- 
ticularly if they are multi-word terms). In our case, the entries in 
the term-document matrix are effectively boolean. Depending on 
the term-ranking algorithm used to select the most important terms, 
the term-document matrix can also be quite sparse. 



We provide Jaccard coefficient (25) as an alternative similarity 
function to accommodate the nearly boolean nature of the term- 
document matrix. Jaccard calculates pairwise term similarity as the 
number of documents two terms appeared together in, divided by 
the number of documents either term appeared in: 



Jacc(Si,Sj) 



\StnSj\ 

\SiUSj\ 



where St and Sj are the sets of documents that the two terms being 
compared appeared in. Like LSA, Jaccard Coefficient produces a 
value between 0, indicating terms did not appear together in any 
documents and have no similarity, and 1 , indicating terms never ap- 
peared separately, and have maximal similarity. Jaccard coefficient 
alone treats terms as atomic units: multi-word terms only match if 
they are identical. This approach produces very sparse similarity 
matrices when used with a ranking algorithm such as C-value that 
prioritizes multi-word terms. 

Partial Match Jaccard Coefficient, attempts to address the spar- 
sity of the C-value matrices, by treating two terms as identical for 
the purpose of co-occurrence calculation if they contain a common 
subsequence of words. For example, if "partial match jaccard co- 
efficient" and "similarity" both occurred as multi-word terms in a 
paper title, and "similarity" and "jaccard coefficient" were present 
in our list of top-terms but "partial match jaccard coefficient" was 
not, this function would count a co-occurrence between "similarity" 
and "jaccard coefficient" because the top term is a subsequence of 
the longer term found in the title. 

3.4 Term Filtering and Distance Calculation 

Term similarities have been calculated between the N highest 
ranked terms in the previous step. The next stage in the pipeline 
is filtering, choosing the terms to include in the map. We imple- 
ment two filtering methods in the MoCS system: Top Terms and 
Pull Lesser Terms. 

Top Terms is the simplest type of filtering, where we take the 
top-ranked K terms from the N highest ranked terms (K <N). The 
default for K in our current system is 150. In practice, sparsity of 
data causes this method to produce fragmented maps, as the top 
K terms often have low similarity to other top terms (particularly 
when the multi-word term-extraction system is used). 

Pull Lesser Terms attempts to address the fragmentation of the 
top terms method, by using not only the highest ranked terms, but 
also maps lesser-ranked terms if they are similar to a top-ranked 
term. Specifically, this method takes as input the N highest ranked 
terms, terms^, and their pairwise similarities, as calculated in the 
ranking and similarity steps of the pipeline. The method plots the K 
highest ranked terms, termsK, from among terms^, and the / most 
similar terms from terms^ for each term in terms k- These / most 
similar terms are plotted regardless of whether they are members of 
the set termsx- Effectively, this method pulls in terms beyond the 
top K, if they are more similar to a top term than any of the other 
top terms. The default parameter values for K and / in our current 
system are K = 90, / = 8. 

The pairwise term similarity matrix is next converted into a ma- 
trix of distances for use by the multi-dimensional scaling or force- 
directed algorithms of GMap. Let Sfatj) G [0, 1] be the similarity 
between two terms, calculated using either LSA, Jaccard Coeffi- 
cient, or Partial Match Jaccard Coefficient. Some choices of docu- 
ment sets and ranking and similarity functions produce terms with 
a similarity distribution more narrow than the theoretical range of 
the similarity function, so rescaled similarity values are calculated 
as 

S(t h tj) 



Sfatj 
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The distance between these two terms, D(tt,tj), is calculated us- 
ing these rescaled similarity values as 

D(t 1 ,t 2 ) = -log[(l-o)-$(t 1 ,t 2 ) + o], 

where a is a small, positive, constant scaling value, currently set 
to 0.1, used to ensure a non-zero value inside the logarithm in the 
case that two terms have a pairwise similarity of 0. Linear trans- 
formations of similarities into distances produced maps that looked 
dense, crowded, and highly fragmented. A logarithmic scale allows 
comparison of relative distance between terms with low pairwise 
similarity by magnifying the distances between these terms. This 
produces more less crowded maps, since most term pairs have low 
pairwise similarity compared to the highest similarity pair of terms 
in the map (which are used in the normalization). 

3.5 Map Generation 

We begin with a summary of the GMap algorithm for generating 
maps from static graphs [24]. The input to the algorithm is a set 
of terms and pairwise similarities between these terms, from which 
an undirected graph G = (V,E) is extracted. The set of vertices V 
corresponds to the terms extracted from titles and the set of edges 
E corresponds to the top pairwise similarities between these terms 
as determined by the chosen filtering algorithm. In its full gener- 
ality, the graph is vertex- weighted and edge- weighted, with vertex 
weights corresponding to some notion of the importance of a ver- 
tex and edge weights corresponding to some notion of the closeness 
between a pair of vertices. In the MoCS system, the relative frequen- 
cies of terms are used to determine the font size of the node label, 
using a linear scale with the minimum frequency term producing 
the smallest label and the maximum frequency term producing the 
largest label. The weight of an edge can be defined by the strength 
of the similarity between a pair of words or terms, and these edges 
can be marked in the base map of terms. 

In the first step of GMap the graph is embedded in the plane 
using a scalable force-directed algorithm fT9) or multidimensional 
scaling (MDS) [29]. In the second step, a cluster analysis is per- 
formed in order to group vertices into clusters, using a modularity- 
based clustering algorithm [35]. 

We use information from the clustering to guide the MDS -based 
layout. In the third step of GMap, the geographic map correspond- 
ing to the data set is created, based on a modified Voronoi diagram 
of the vertices, which in turn is determined by the embedding and 
clustering. Here "countries" are created from clusters, and "conti- 
nents" and "islands" are created from groups of neighboring coun- 
tries. Borders between countries and at the periphery of continents 
and islands are created in fractal-like fashion. Finally, colors are as- 
signed with the goal that no two adjacent countries have colors that 
are too similar. In the context of visualizing dynamic data where the 
relative change of popularity of important, we also use a heatmap 
overlay to highlight the "hot" regions. Further geographic compo- 
nents can be added to strengthen the map metaphor. For instance, 
edges can be made semi-transparent or even modified to resemble 
road networks. In places where there are large empty spaces be- 
tween vertices in neighboring clusters, lakes, rivers, or mountains 
can be added, in order to emphasize the separation. 

3.6 Heatmap Overlays 

To visualize the profile of a target query set of papers (for exam- 
ple, papers from a specified time range, author, conference, or jour- 
nal) over a map, we use heatmap overlays. Heatmaps highlight the 
terms in the basemap that also occur in the target query, with color 
intensity proportional to the frequency of the term's occurrence in 
the heatmap query. Separate database queries are issued for the pa- 
pers used to produce the basemap and heatmaps (Fig. [2]), allowing a 
subset of the papers chosen for the basemap to be used as the target 
query. For example, a basemap can be constructed from a sample 



of all available papers, and a heatmap constructed from all papers 
for a particular journal (Fig. [6d]), or a heatmap of a single author 
can be overlaid on a basemap of papers from a journal that author 
frequently publishes in (Fig. [1}. 

Whichever type of terms are chosen for the basemap (multi- word 
or single-word) are also used to construct the heatmap overlay. If 
termsB is the set of terms found in the query used to produce the 
basemap, and terms? is the set of terms from the documents to be 
visualized in the heatmap, the terms highlighted in the heatmap are 
the intersection of these two groups, termsn = termsB Htermsj- 
The heatmap intensity, I(t), of each term t in terms h is the number 
of times t appeared in documents in the target query. These inten- 
sities are transformed on a logarithmic scale to allow terms with 
low / values to be visible in the heatmap, and then normalized so 
that the most frequently appearing term has intensity 1. The final 
normalized and rescaled intensity value, I(t) is 



I(t) = 



log(/(0+/5) 



^t&ermsn 



log(/(?)+j8) 



where /3 is a small additive constant (currently set to 1) that ensures 
terms that only appeared once in the heatmap query still receive a 
positive I(t) value. 

Basemaps are rendered in the browser as vector graphics, and 
heatmaps are drawn as a semi-transparent raster overlay using the 
OpenLayers heatmap implementation. This implementation uses 
a radial gradient centered at terms with a defined intensity value, 
where the color intensity at the center of the term is proportional to 
the / value for the term. Currently, the radius of diffusion for the 
radial gradient is constant across all terms in a map and chosen to 
correspond to roughly half the average distance between terms. To 
ensure that each term in termsu has an overlay that exactly cov- 
ers its visual area in the basemap, this method might be improved 
by making the diffusion radius for each term a function of the dis- 
tance to the closest term in the map. Alternatively, Inverse Distance 
Weighting could be used to calculate color intensities for all points 
over the basemap based on the heatmap intensity values of all terms. 

4 DBLP Visualization 

4.1 Individual Heatmap Overlays 

The MoCS system allows separate database queries for the docu- 
ments used to produce the basemap and the documents used to pro- 
duce the heatmap overlay. Using the author information in DBLP, 
we can produce heatmap overlays of individual researchers over 
conferences and journals that they frequently publish in. Figure [5] 
shows a basemap constructed from titles of all papers published at 
the Conference on Neural Information Processing Systems (NIPS), 
with a heatmap constructed from the titles of papers by the most 
prolific author at NIPS. We see activity throughout the basemap, 
with particular intensity over a section of terms referring to infer- 
ence in graphical models. 

4.2 Conference and Journal Overlays 

The bibliographic information stored in DBLP allows us to plot 
heatmaps of specific conferences and journals over a basemap of 
all documents. Fig. [6] shows heatmaps of papers from four venues: 
the Computer Vision and Pattern Recognition conference (CVPR), 
the Symposium on Theory of Computing (STOC), the International 
Conference on Web Services (ICWS), and Transactions on Visu- 
alization and Computer Graphics (TVCG). These heatmaps are 
plotted from all available paper titles in the DBLP database for 
each venue. The basemap over which the heatmaps are plotted is 
made from 70,000 paper titles sampled uniformly from all entries in 
DBLP. Some similarities can be seen between the venues: all share 
relatively high intensity in their heatmaps over terms "application", 




Figure 5: A heatmap produced from 75 papers by the author who 
has published most frequently at NIPS, over a basemap made from 
multi-word terms extracted from titles of 3,553 NIPS papers. The 
algorithms used to produce the basemap are C-Value with Uni- 
grams ranking, Partial Match Jaccard Coefficient similarity, and 
Pull Lesser Terms filtering, with the number of terms parameter 
set to 1,100. 



"analysis", "method", and "evaluation" Some notable topical dif- 
ferences between venues also stand out. CVPR has a high inten- 
sity region in the northwest corner of the map over terms such as 
"images", "objects", and "recognition", while STOC has most high 
intensity in the northeast corner of the map, over terms related to 
"graphs", "complexity", and "graphs". ICWS has a high intensity 
in the south of the map over terms "web services" and "systems" 
while TVCG is literally all over the map, as visualization is associ- 
ated with all areas of computing: from visualization of algorithms 
to algorithms for visualization, from design and analysis to appli- 
cations and systems. 

Effective heatmap coverage is a function both of the number 
of available documents being plotted, and how well terms in the 
heatmap query set are represented in the base map. Comparing 
the TVCG and ICWS heatmaps to the CVPR and STOC heatmaps 
demonstrates this relationship between document availability and 
heatmap coverage. Fewer papers are available for TVCG and ICWS 
in DBLP, causing these venues have lesser representation in the 
basemap (which is constructed from documents randomly sampled 
from all documents in DBLP), and so their heatmaps cover less 
area. 

4.3 Temporal Heatmap Overlays 

Specifying different date ranges for heatmap queries allows the 
generation of maps that show how areas of research have spread 
across the topic basemaps over time. The maps in Fig. [7] show how 
terms in the titles of papers published in the Journal of the ACM 
(JACM) have shifted over the past six decades, starting in 1954. 
The heatmap for papers from 1954-1963 has high intensity values 
over terms dealing with numerical and matrix methods. Compu- 
tational complexity grows in intensity in the 1964-1973 map, and 
complexity and algorithmic bounds outpace numerical methods in 
1974-1983. The algorithmic bound terms remain consistently in- 
tense throughout the remaining decades. An easy to notice trend is 
that the focus of the journal has noticeably narrowed over time: in 




(a) Heatmap for CVPR made from 3,665 documents 



(b) Heatmap for STOC made from 2,685 documents 




(c) Heatmap for ICWS made from 1,288 documents 



(d) Heatmap for TVCG made from 1,826 documents 



Figure 6: Conference and journal heatmaps overlaid on a map generated from 70,000 paper titles, sampled uniformly from all available DBLP 
papers. Map generation algorithms are Multi-Word terms for term extraction, C- value With Unigrams for term ranking, Partial Match Jaccard 
Coefficient for similarity, and Pull Lesser Terms for filtering, with the number of top terms parameter set to 1500. 



the first four decades the topics are all over the map, but in the last 
decade the topics are concentrated around complexity, algorithms, 
and bounds. 

4.4 Individual Paper Heatmaps 

To construct a heatmap visualization of the topics in a single paper, 
we can run the same term extraction algorithms outlined above on 
the abstract or full body text of a paper. This heatmap is then over- 
laid on a basemap constructed from DBLP paper titles, as above. 
Figure [8] shows a heatmap constructed from terms in the abstract of 
this paper, over a single- word basemap of TVCG paper titles. 

5 Implementation 

5.1 Modularity 

The system is built with a modular design to accommodate future 
incorporation of additional natural language processing algorithms. 
Each of the stages of the map generation pipeline (ranking, similar- 
ity, and filtering) is handled by a separate module of code. Within 



each module, the functions that perform the module's task are de- 
signed to be substitutable, taking standardized input and output. We 
plan to expand the system's capabilities by testing the ability of 
other ranking, similarity, and filtering algorithms to produce maps 
that provide a better visual representation of the underlying topic 
space. Source code for the system is available for others who wish 
to experiment with algorithms of their own. 

5.2 Database 

Paper titles and meta-information are stored in a SQL database, 
containing entries for 2,184,270 papers, journal articles, conference 
proceedings, theses, and books. This bibliographic information is 
parsed from an XML dump of DBLP entries, containing author, 
conference or journal, and date meta-information for each paper ti- 
tle [31]. There are over one million personal web pages listed in 
DBLP, with title "Home Page", and tag information is used to ex- 
clude these. Additionally, as DBLP contains papers with titles in 
several languages, an effort is made to detect the language of each 
title, using a trigram character classifier. Titles classified as English 




(a) Heatmap for 1954-1963 made from 399 paper titles 



(b) Heatmap for 1964-1973 made from 400 paper titles 




(c) Heatmap for 1974-1983 made from 400 paper titles 



(d) Heatmap for 1984-1993 made from 400 paper titles 





(e) Heatmap for 1994-2003 made from 372 paper titles 



(f) Heatmap for 2004-2013 made from 284 paper titles 



Figure 7: Heatmaps of six decades of papers from Journal of the ACM (JACM). Basemap is generated from multi-word terms extracted from 
the titles of 1,998 paper titles published in JACM, using the C- Value with Unigrams ranking, Partial Match Jaccard Coefficient similarity, and 
Pull Lesser Terms filtering functions. A maximum of 400 paper titles were sampled from the JACM's publications for each decade. 




Figure 8: A heatmap from the abstract of this paper, over a basemap 
from 1,343 TVCG paper titles, using the TF ranking, LSA similar- 
ity, and Pull Lesser Terms filtering functions, with the number of 
terms parameter set to 1700. 



by this classifier are marked in the database, and only these titles are 
currently used in the map generation. Each paper is associated with 
its author and journal or conference if this information is available 
in DBLP. The database contains records for 1,324 journals, 6,904 
conferences, and 1,237,445 authors which can be used to filter doc- 
ument title queries for map construction. 

5.3 Server 

The system is implemented using Python 2.7. Full source code (for 
map making, the DBLP database interface, and the web server) is 
available at github ". com/dpf ried/mocs| Natural language 
processing code for term extraction is implemented using utilities 
from the NLTK (3) library. The NumPy and SciPy (27) numeri- 
cal computation libraries are used for implementing the similarity 
functions and ranking algorithms. The server is hosted in Django, 
using Celery as a back-end task manager, and SQLAlchemy for 
database interface. Maps are displayed in the user's browser us- 
ing SVG rendering capabilities of AT&T's Graph Viz system [13]. 
These SVG elements are rendered in a zoomable and pannable con- 
tainer provided by the open source OpenLayers JavaScript display 
library 1 1 1. Heatmaps are overlaid with the heatmap plugin in Open- 
Layers, together with additional JavaScript that calculates term po- 
sitions and SVG coordinate transforms, in order to correctly posi- 
tion the heatmap over the basemap when zooming. 

6 Conclusions and Future Work 

In this paper we presented a practical approach for visualizing 
large-scale a bibliographic data via natural language processing and 
using a geographic map metaphor. We described the MoCS system 
in the context of the DBLP bibliography server and demonstrated 
several possible explorative visualization uses of the system. The 
novel aspects of the system include modifications to natural lan- 
guage processing techniques (allowing us to work with only titles 
of research papers), the ability to combine arbitrary basemaps with 
heatmap overlays (showing temporal evolution, or profiles of con- 
ferences and journals), and the modularity and availability of the 
interactive visualization system (making it possible to experiment 
with different approaches to various subproblems). An interactive 



interface to the system, and a video of the system in action, are 
available at mocs . cs . arizona . edul 

There are likely more possible uses of such a visualization sys- 
tem. For example, many journals (e.g., Cell, Earth and Planetary 
Science, Molecular Phylogenomics and Evolution) have recently 
added requirements for graphical abstracts as a part of research pa- 
pers. These are single -panel images designed to give readers an 
immediate understanding of the take-home message of the paper. 
MoCS can be used to generate graphical abstracts using a basemap 
from the journal and heatmap of the submission. 

We would have liked to compare the performance of our sys- 
tem against earlier and related approaches. However, this is nearly 
impossible as very few such systems are fully functional online or 
provides source code. We contacted the authors of a dozen earlier 
semantic word-cloud or spatialization based systems but none were 
able to share source code or executables. 

While ours is indeed a functional system, and it does offer vari- 
ous options for the natural language processing step, for the gener- 
ation of the graph, and for the final map rendering, there are many 
possible future directions: 

1. We would like to experimentally verify whether our maps 
based on research paper titles correspond to what experts in 
the field expect to see. If not, topic models for term extraction 
and similarity that incorporate lexical priors, or words that are 
of specific interest can be used [26]. Thus, we could specify 
"seed words" or "start words" that must included in the map. 

2. How much additional information and precision can be gained 
from abstracts compared to just titles of papers? Similarly, 
what is the additional information gain when going from ab- 
stracts to entire papers? 

3. We can study departmental, state- wide, and even country- 
wide profiles over the base map of CS. This would hopefully 
allow us to visually compare and contrast the type of research 
done in different universities, states, and countries. 

4. Automatically labeling countries on the map could be accom- 
plished by looking for the most frequent conferences and jour- 
nals with topics in a particular country, and extracting the top 
2-3 relevant terms. 

5. Statistical methods for multi-word term extraction and rank- 
ing, such as topical w-grams [46] or LexRank [14] may allow 
us to produce terms that are more representative of topics in 
the document titles than the terms extracted through POS tag- 
ging and pattern matching alone. 

6. Only terms that appear in both the basemap and heatmap 
queries are currently displayed in heatmaps. To create 
heatmaps that also cover related terms in the map, the pair- 
wise term similarity values could be used to diffuse heatmap 
intensity onto terms that were unseen in the heatmap query, 
but that are similar to those seen in the query. 

7. The graph embedding and graph clustering combinations that 
are available in GMap often result in fragmented maps. We 
would like to expand the functionality of GMap by providing 
cluster-based (and thus non-fragmented) embedding. 

8. The methodology described here is not limited to computer 
science research papers. It should be possible to generalize to 
other research areas, starting with physics (due to ArXiv) and 
medicine (PubMed). 
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